0.8 for suspended sediment concentration and nitrogen during the testing stage, while the ensemble model exhibited R2>0.95"2>0.95. Watershed health values with respect to suspended sediments and nitrogen predicted by all ML models including the ensemble model were lower for areas with larger agricultural land use, moderate for areas with predominant urban land use, and higher for forested areas; the trained ML models adequately predicted WH in ungauged basins. However, low WH values (with respect to phosphorus) were predicted at some basins in the Upper Mississippi River Basin that had dominant forest land use. Results suggest that the proposed ML models provide robust estimates at ungauged locations when sufficient training data are available for a WQ constituent. ML models may be used as quick screening tools by decision makers and water quality monitoring agencies for identifying critical source areas or hotspots with respect to different water quality constituents, even for ungauged watersheds.     A novel methodology is presented for estimating watershed health (WH) with respect to water quality (WQ) in ungauged basins. Although the ensemble machine learning model was developed for three major river basins in Midwest, the methodology is transferable to other watersheds and can be applied to WQ parameters of choice. The lessons learned from the model application can aid resources managers and environmental decision makers to prioritize WQ impaired sub-watersheds and target hotspot areas in the Midwest river basins for TMDL development and management actions.      Abstract: Effective water quality management and reliable environmental modeling depend on the availability, size, and quality of water quality (WQ) data. Observed stream water quality data are usually sparse in both time and space. Reconstruction of water quality time series using surrogate variables such as streamflow have been used to evaluate risk metrics such as reliability, resilience, vulnerability and watershed health (WH), but only at gauged locations. Estimating these indices for ungauged watersheds has not been attempted because of the high-dimensional nature of the potential predictor space. In this study, Machine learning (ML) models, namely random forest regression, AdaBoost, gradient boosting machines, and Bayesian ridge regression, along with ensemble averages, were evaluated to predict watershed health and other risk metrics at ungauged hydrologic unit code 10 (HUC-10) basins using watershed attributes, long-term climate data, soil data, land use and land cover data, fertilizer sales data, and geographic information as predictor variables. These ML models were tested over Upper Mississippi River Basin, Ohio River Basin, and Maumee River Basin for water quality constituents such as suspended sediment concentration, nitrogen and phosphorus. Random forest, AdaBoost, and gradient boosting regressors typically showed  for suspended sediment concentration and nitrogen during the testing stage, while the ensemble model exhibited . Watershed health values with respect to suspended sediments and nitrogen predicted by all ML models including their ensemble average were lower for areas with larger agricultural land use, moderate for areas with predominant urban land use, and higher for forested areas, and the trained ML models adequately predicted WH in ungauged basins. However, low WH values (with respect to phosphorus) were predicted at some basins in Upper Mississippi River Basin that had dominant forest land use. Results suggest that the proposed ML models provide robust estimates at ungauged locations when sufficient training data are available for a WQ constituent. ML models maybe used as quick screening tools by decision makers and water quality monitoring agencies for identifying critical source areas or hotspots with respect to different water quality constituents, even for ungauged watersheds.    " /> A Machine Learning Approach to Predict Watershed Health Indices for Sediments and Nutrients at Ungauged Basins | Air Research | US EPA

Science Inventory

A Machine Learning Approach to Predict Watershed Health Indices for Sediments and Nutrients at Ungauged Basins

Citation:

Mallya, G., Mohamed M. Hantush, AND R. Govindaraju. A Machine Learning Approach to Predict Watershed Health Indices for Sediments and Nutrients at Ungauged Basins. WATER. MDPI, Basel, Switzerland, 15(3):586, (2023). https://doi.org/10.3390/w15030586

Impact/Purpose:

A novel methodology is presented for estimating watershed health (WH) with respect to water quality (WQ) in ungauged basins. Although the ensemble machine learning model was developed for three major river basins in Midwest, the methodology is transferable to other watersheds and can be applied to WQ parameters of choice. The lessons learned from the model application can aid resources managers and environmental decision makers to prioritize WQ impaired sub-watersheds and target hotspot areas in the Midwest river basins for TMDL development and management actions.      Abstract: Effective water quality management and reliable environmental modeling depend on the availability, size, and quality of water quality (WQ) data. Observed stream water quality data are usually sparse in both time and space. Reconstruction of water quality time series using surrogate variables such as streamflow have been used to evaluate risk metrics such as reliability, resilience, vulnerability and watershed health (WH), but only at gauged locations. Estimating these indices for ungauged watersheds has not been attempted because of the high-dimensional nature of the potential predictor space. In this study, Machine learning (ML) models, namely random forest regression, AdaBoost, gradient boosting machines, and Bayesian ridge regression, along with ensemble averages, were evaluated to predict watershed health and other risk metrics at ungauged hydrologic unit code 10 (HUC-10) basins using watershed attributes, long-term climate data, soil data, land use and land cover data, fertilizer sales data, and geographic information as predictor variables. These ML models were tested over Upper Mississippi River Basin, Ohio River Basin, and Maumee River Basin for water quality constituents such as suspended sediment concentration, nitrogen and phosphorus. Random forest, AdaBoost, and gradient boosting regressors typically showed  for suspended sediment concentration and nitrogen during the testing stage, while the ensemble model exhibited . Watershed health values with respect to suspended sediments and nitrogen predicted by all ML models including their ensemble average were lower for areas with larger agricultural land use, moderate for areas with predominant urban land use, and higher for forested areas, and the trained ML models adequately predicted WH in ungauged basins. However, low WH values (with respect to phosphorus) were predicted at some basins in Upper Mississippi River Basin that had dominant forest land use. Results suggest that the proposed ML models provide robust estimates at ungauged locations when sufficient training data are available for a WQ constituent. ML models maybe used as quick screening tools by decision makers and water quality monitoring agencies for identifying critical source areas or hotspots with respect to different water quality constituents, even for ungauged watersheds.    

Description:

Effective water quality management and reliable environmental modeling depend on the availability, size, and quality of water quality (WQ) data. Observed stream water quality data are usually sparse in both time and space. Reconstruction of water quality time series using surrogate variables such as streamflow have been used to evaluate risk metrics such as reliability, resilience, vulnerability, and watershed health (WH) but only at gauged locations. Estimating these indices for ungauged watersheds has not been attempted because of the high-dimensional nature of the potential predictor space. In this study, machine learning (ML) models, namely random forest regression, AdaBoost, gradient boosting machines, and Bayesian ridge regression (along with an ensemble model), were evaluated to predict watershed health and other risk metrics at ungauged hydrologic unit code 10 (HUC-10) basins using watershed attributes, long-term climate data, soil data, land use and land cover data, fertilizer sales data, and geographic information as predictor variables. These ML models were tested over the Upper Mississippi River Basin, the Ohio River Basin, and the Maumee River Basin for water quality constituents such as suspended sediment concentration, nitrogen, and phosphorus. Random forest, AdaBoost, and gradient boosting regressors typically showed a coefficient of determination R2>0.8"2>0.8 for suspended sediment concentration and nitrogen during the testing stage, while the ensemble model exhibited R2>0.95"2>0.95. Watershed health values with respect to suspended sediments and nitrogen predicted by all ML models including the ensemble model were lower for areas with larger agricultural land use, moderate for areas with predominant urban land use, and higher for forested areas; the trained ML models adequately predicted WH in ungauged basins. However, low WH values (with respect to phosphorus) were predicted at some basins in the Upper Mississippi River Basin that had dominant forest land use. Results suggest that the proposed ML models provide robust estimates at ungauged locations when sufficient training data are available for a WQ constituent. ML models may be used as quick screening tools by decision makers and water quality monitoring agencies for identifying critical source areas or hotspots with respect to different water quality constituents, even for ungauged watersheds.    

Record Details:

Record Type:DOCUMENT( JOURNAL/ PEER REVIEWED JOURNAL)
Product Published Date:02/02/2023
Record Last Revised:06/14/2023
OMB Category:Other
Record ID: 357241